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Abstract 

We consider a class of vector autoregressive models with banded coefficient matrices. The 
setting represents a type of sparse structure for high-dimensional time series, though the implied 
autocovariance matrices are not banded. The structure is also practically meaningful when the 
order of component time series is arranged appropriately. The convergence rates for the esti¬ 
mated banded autoregressive coefficient matrices are established. We also propose a Bayesian 
information criterion for determining the width of the bands in the coefficient matrices, which 
is proved to be consistent. By exploring some approximate banded structure for the auto¬ 
covariance functions of banded vector autoregressive processes, consistent estimators for the 
auto-covariance matrices are constructed. 

Keywords: Banded auto-coefficient matrices; BIC; Frobenius norm; Vector autoregressive model. 


1 Introduction 

The demand for modelling and forecasting high-dimensional time series arises from panel studies of 
economic, social and natural phenomena, financial market analysis, communication engineering and 
other domains. When the dimension of time series is even moderately large, statistical modelling is 
challenging, as vector autoregressive and moving average models suffer from lack of identification, 
over-parameterization and flat likelihood functions. While pure vector autoregressive models are 
perfectly identifiable, their usefulness is often hampered by the lack of proper means of reducing the 
number of parameters. 


1 


In many practical situations it is enough to collect the information from neighbour variables, 
though the definition of neighbourhoods is case-dependent. For example, sales, prices, weather in¬ 
dices or electricity consumptions influenced by temperature depend on those at nearby locations, in 
the sense that the information from farther locations may become redundant given that from neigh¬ 
bours. See, for example, Can and Mebolugbe (1997) for a house price example which exhibits such 
a dependence structure. In this paper, we propose a class of vector autoregressive models to cater 
for such dynamic structures. We assume that the autoregressive coefficient matrices are banded, i.e., 
non-zero coefficients form a narrow band along the main diagonal. The setting specifies explicit 
autoregression over neighbour component series only. Nevertheless, non-zero cross correlations 
among all component series may still exist, as the implied auto-covariance matrices are not banded. 
This is an effective way to impose sparse structure, as the number of parameters in each autore¬ 
gressive coefficient matrix is reduced from p 2 to 0(p ), where p denotes the number of time series. 
In practice, a banded structure may be employed by arranging the order of component series ap¬ 
propriately. The ordering can be deduced from subject knowledge aided by statistical tools such as 
Bayesian information criterion; see Section 15.21 With the imposed banded structure, we propose 
least squares estimators for the autoregressive coefficient matrices which attain the convergence 
rate ( p/n j 1 / 2 under the Frobenius norm and (log p/n) 1 ^ 2 under the spectral norm when p diverges 
together with the length n of time series. 

In practice the maximum width of the non-zero coefficient bands in the coefficient matrices, 
which is called the bandwidth, is unknown. We propose a marginal Bayesian information crite¬ 
rion to identify the true bandwidth. It is shown that this criterion leads to consistent bandwidth 
determination when both n and p tend to infinity. 

We also address the estimation of the autocovariance functions for high-dimensional banded au¬ 
toregressive models. Although the autocovariance matrices of a banded process are unlikely to be 
banded, they admit some asymptotic banded approximations when the covariance of innovations is 
banded. Because of this property, the band-truncated sample autocovariance matrices are consis¬ 
tent estimators with the convergence rate logf n/ logpjflogp/n) 1 / 2 , which is faster than that for the 
standa r d banding covariance e stimators (Bickel and Levina . 2008 1. See also Wu and Pourahmadi 


( 20091) , Bickel and Gel 2011) and lLeng and L i (2011) for the estimation of the banded covariance 
matrices of time series. 

Most existing work on high-dimensional autoregressive models draws inspiration from recent 
developments in high-dimensional regression. For example, Hsu et al. (2008) proposed lasso penal¬ 
ization for subset autoregression. Haufe et alH ( 2010b introduced the group sparsity for coefficient 
matrices and advocated use of group lasso penal ization. A truncated we i ghted lasso and group 


lasso penalization approaches were proposed by IShojaie an d Michailidis 20100 and iBasu et al. 


( 2015b . respectively, to explore graphical Granger causality. Basu and Michailidis ( 2015 ) focused 
on stable Gaussian processes and investigated the theoretical properties of Li -regularized esti¬ 
mates of transition matrix in sparse autoregressive models. Bol stad et al. (2011) inferred sparse 
causal networks through vector autoregressive processes and proposed a group lasso procedure. 


Kock and Callot (2015j) established oracle inequalities for high-dimensional vector autoregressive 
models. Han and Liu ( 2015 ) proposed an alternative Dantzig-type penalization and formulated the 
estimation problem into a linear program. Chen et al. (120131) studied sparse covariance and preci¬ 
sion matrix in high dimensional time series under a general dependence structure. 
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2 Methodology 

2.1 Banded vector autoregressive models 

Let ^beapx 1 time series defined by 


y t — +-b A d y t _ d + s t , 


( 1 ) 


where e t is the innovation at time t, E(e t ) = 0 and var(e t ) = E(e t eJ) = E e , and e t is independent 
of y t _i,y t _ 2 , • • •• Furthermore, all the coefficient matrices Ai, ..., A d are banded in the sense that 


a if = 0, \i~j\ > k o, £=l,...,d, 


( 2 ) 


where denotes the (i,j)- th element of Ag. Thus the maximum number of non-zero elements in 
each row of Ag_ is the bandwidth 2 k 0 + 1, and k 0 is called the bandwidth parameter. We assume that 
/;•() > 0 and d > 1 arc fixed integers, and p> k 0 ,d. Our goal is to determine k 0 and to estimate the 
banded coefficient matrices ..., A d . For simplicity, we assume that the autoregressive order d is 
known, as the order-determination problem has already been thoroughly studied; see, e.g., Chapter 
4 of Lutkepohl ( 2007 ). 


Under the condition det(J p — A\z - A d z d ) ^ 0 for any \z\ < 1 , model © admits a weakly 

stationary solution {y t }, where I,, denotes the p x p identity matrix. Throughout this paper, y, refers 
to this stationary process. If, in addition, e t is independent and identically distributed, y t is also 
strictly stationary. 

In model ©, we do not require var(e t ) = E e to be banded, but even if it is, the autocovariance 
matrices are not necessarily banded; see (fl2l) below. Therefore, the proposed banded model is ap¬ 
plicable when the linear dynamics of each component series depend predominately on its neighbour 
series, though there may be non-zero correlations among all component series of y t . 


2.2 Estimating banded autoregressive coefficient matrices 

Since each row of A f has at most 2 k 0 + 1 non-zero elements, there are at most (2 k 0 + 1 )d regressors 
in each row on the right-hand side of ©. For i = 1,..., p, let fa be the column vector obtained by 
stacking the non-zero elements in the i-th rows of A u , A d together. Let r, denote the length of 
/3j. Then 


f (2fc 0 + 1 )d, i = k 0 + 1, k 0 + 2,...,p-k Q , 

\(2ko + l-j)d, i = k 0 + l-j or p-k 0 + j, j = l,...,k 0 . 


Now © can be written as 

Vi,t = *ltPi + £ i,t, i = 1, • • •, P, ( 4 ) 

where y ijt , e ijt are respectively the z-th component of y t and e t and x rJ is the r, x 1 vector consisting 
of the corresponding components of y t _ l7 ... ,y t _ d . Consequently, the least squares estimator of 
based on © is 

A = {XjX^Xjy^ (5) 
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where y,^ = (z/i t d+ 1 , • • •, Ui,nY, and X t is an (n — d) x r, matrix with x [ d+ ■ as its j-th row. 

By estimating /%, i = 1,... ,p, separately based on ©, we obtain the least squares estimators 
Ai ,..., Ad for the coefficient matrices in ©. Furthermore, the resulting residual sum of squares is 

RSSj = RSSj(fc 0 ) = yj^ln-d - X^XjX^Xjjy^. (6) 

We write this as a function of k 0 to stress that the above estimation presupposes that the bandwidth 
is (2 k 0 + 1) in the sense of ©. 

2.3 Determination of bandwidth 

In practice the bandwidth is unknown and we need to estimate k 0 . We propose to determine k () based 
on the marginal Bayesian information criterion, 

BlCj(fc) = log RSSi(k) + —d,Ti(k)C n log(p V n), i = l,...,p, (7) 

n 

where RSS j(fc) and are defined, respectively, in © and ©, p V n = ma x(p, n), and CJ n > 0 
is some constant which diverges together with n ; see Condition 2. We often take C n to be log log n. 
An estimator for k 0 is 

k = max { arg min BIC,;(fc)|, (8) 

1 <i<p *• l<k<K v j 

where K > 1 is a prescribed integer. Our numerical study shows that the procedure is insensitive 
to the choice of K provided K > k 0 . In practice, we often take K to be [n 1 / 2 ] or choose K by 
checking the curvature of BlCj(fc) directly. 

Remark 1. If the order d is unknown, we can modify the criterion in ([8]) as follows. Let RSS , (k, (!) 
and Ti(k, /:') be defined similarly to © and ©. The marginal Bayesian information criterion is 

BICj(fc, i) = log RSS,;(£;, €) + —Ti(k, £)C n log(p V n), i = l,...,p. (9) 

n 

Let L be a prescribed integer upper bound on d and often taken to be 10 or [n 1//2 ]. Let 

(ki'di) = arg min BICT/c, £), i — 1,..., p, 

and k = maxi<,< p k, and d = maxi<,< p di. Proposition 1 in the Supplementary Material shows that 
under Conditions 1-4 in Section 3.1, pr (k = k 0 , d = d) —* 1 as n and p —y oo. 

Remark 2. The banded structure of the coefficient matrices A 1 ,..., A d depends on the order of 
the component series of y t . In principle it is possible to derive a complete data-driven method to 
deduce the optimal ordering which minimizes the bandwidth, but such a procedure is computation¬ 
ally burdensome for large p. For most applications meaningful orderings are suggested by practical 
consideration. We can then calculate 

v 

BIC = ^BIQ(fc) (10) 

2=1 

for each suggested ordering, and choose the ordering which minimizes (flOl) . In expression ( flOl) . 
BlCj(-) and k are defined as in © and ©. Two real data examples in Section 5.2 indicate that this 
scheme works well in applications. 
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3 Asymptotic properties 


3.1 Regularity conditions 


For vector v = (iq,..., Vj) and matrix B = {b t j), let 


IMI q = (XI M 9 ) /q , IMU = max kj|, \\B\\ q = max ||Sv||„ ||£||f = (X 6 i) ' ’ 

3= 1 9 iJ 

i.e., || • 11 q denotes the l q norm of a vector or matrix, and || • || j? denotes the Frobenius norm of a 
matrix. 

First we note that the model CD can be formulated as, 

y t = Ay t _ i + e u 


where 


y t 


( y t ^ 

yt -1 


C A 1 A 2 




£t — 


( £t ^ 
Opx 1 


( 11 ) 


\ yt—d+i ) \ 0 ■■■ i p 0 J y 0 px 1 J 

Now we list the regularity conditions required for our asymptotic results. 

Condition 1. For A defined in (flTT) . |j/111 9 < C and 11 A 30 \\ 2 < d j0 , where C > 0, 5 G (0,1) and 
jo > 1 are constants^free of n and p, and j 0 is an integer. 

Condition F. For A defined in (flTT) . ||/F° || 2 < S jo , 11 /111oc < C and H/F'jloc. < b ]0 , where C > 0 , 
5 G ( 0 , 1 ) and j 0 > 1 are constants free of n and p, and j 0 is an integer. 

Condition 2. Let af^ be the (?', 7 ) - 1 h element of A e . For each 1 = 1...../v, \a ( f- +k(> \ or |af)_ feo | is 
greater than {C n kon~ l log(p V n)} 1 / 2 for some 1 < I < d, where C n —)■ cxd as n —> 00 . 

Condition 3. The minimal eigenvalue A lllin {cov(y,)} > n\ and maxi<,< p an < k 2 for some pos¬ 
itive constants k, and k 2 free of p, where a„ is the i-th diagonal element of cov(y,), and A min (-) 
denotes the minimum eigenvalue. 

Condition 4. The innovation process {e t , i = 0 , ± 1 , ±2,...} is independent and identically dis¬ 
tributed with zero mean and covariance E e . Furthermore, one of the two assertions holds: 


(i) maxi<i< p E(\£ ht \ 2> i) < C and p = O(n^), where q > 2, f3 e (0, (q — 2)/4) and 
C > 0 are some constants free of n and p\ 

(ii) maxi<j< p L , {exp(Ao|£j,t| 2Q )} < C and logp = o{n"^ 2_Q ^}, where A 0 > 0 , 
a G (0,1] and C > 0 are constants free of 71 and p. 


Provided {e t } is independent and identically distributed, Condition 1 implies that y, is strictly 
stationary and that for any j > 1, ||AP ||2 < C5 3 with some constant C > 0 and 5 G (0,1). 
The independent and identically distributed assumption in Condition 4 is imposed to simplify the 
proofs but is not essential. Condition 2 ensures that the bandwidth (2 k 0 + 1) is asymptotically 
identifiable, as {n -1 loyj p V n) } 1/2 is the minimum order of a non-zero coefficient to be identifiable; 
see, e.g., Luo and Chen I ( 201 3 1). Condition 3 guarantees that the covariance matrix var(y/) is strictly 
positive definite. Condition 4 specifies the two asymptotic modes: (i) high-dimensional cases with 
p = 0(nP), and (ii) ultra high-dimensional cases with logp = o{n“A 2_Q )}. 
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3.2 Asymptotic theorems 

We first state the consistency of the selector k, defined in ([8]), for determining the bandwidth param¬ 
eter k 0 . 

Theorem 1. Under Conditions 1-4, pr(k — k 0 ) —> 1 as n —> oo. 

Remark 3. In TheoremU] k 0 is assumed to be fixed, as in applications small k 0 is of particular inter¬ 
est. But we can allow the bandwidth parameter k 0 to diverge as n, p —> oo. To show its consistency, 
the regularity conditions would need to be strengthened. To be specific, if k 0 -C C'“ 1 n/ log(p V n ), 
pr (k = k 0 ) —> 1 as n —Y oo under Conditions 1’ and 2-4 in Section 3.1; see the Supplementary 
Material. 

Since k 0 is unknown, we replace it by k in the estimation procedure for A\,, A,i described 
in Section l22l and still denote the resulted estimators by A],, Ad. Theorem 2 addresses their 
convergence rates. 

Theorem 2. Let Conditions 1-4 hold. As n —> oo, it holds for j = 1..... r/ that 

\\Aj - Aj\\ p = Opj(p/n) 1/2 }, 11A,- - Aj || 2 = 0 P |(logp/n) 1/2 |. 


Conditions 4(i) and 4(ii) impose, respectively, a high moment condition and an exponential tail 
condition on the innovation distribution. Although the convergence rates in Theorem 2 have the 
same expressions in terms of n and p, due to the different conditions imposed on them in Condi¬ 
tions 4(i) and 4(ii), the actual convergence rates are different under the two settings. For example, 
Condition 4(i) allows p to grow in the order n’\ which implies the convergence rate (logn/n) 1/2 
for Aj under the spectral norm. On the other hand, Condition 4(ii) may allow p to diverge at the 
rate exp{n r>/ ' 2 o f _ 2f } for a small constant e > 0, and the implied convergence rate for Aj under the 
spectral norm is n l / 2 + e - a /i A - 2a ). 


4 Estimation for auto-covariance functions 

For the banded vector autoregressive process y t defined by (Q]), the auto-covariance function = 
co v(y t ,y t+ j) is unlikely to be banded. For example for a stationary banded autoregressive process 
with order 1, it can be shown that 

OO 

So = var(y j = S e + £ A\E e K)k (12) 

i= 1 

For any banded matrices Bi and B 2 with bandwidths 2 A-, + 1 and 2 k 2 + 1, respectively, the product 
B\B 2 is a banded matrix with the enlarged bandwidth 2(A:, + k 2 ) + 1 in general. Thus S 0 presented 
in (fl2l) is not a banded matrix. Nevertheless if var(c/) = S e is also banded, Theorem 3 shows that 
E j can be approximated by some banded matrices. 

Condition5. The matrix S £ is banded with bandwidth 2 so + 1 and ||E e ||i <C < oc, where C, So > 0 
are constants independent of p, and so is an integer. 


6 




Theorem 3. Let Conditions 1 and 5 hold. For any integers r, j > 0, there exists a banded matrix 
Ej - with bandwidth 2{(2r + j)k 0 + So} + 1 such that 

||E$ r) - Sj|| 2 < Ci S 2{r+j)+ \ ||Ef - Ej||i <C 2 r S 2{r+j)+1 , 


where C\ and C 2 are positive constants independent of r and p, and 5 G (0,1) is specified in 
Condition 1. 


Under Condition 5, E^ = E e + ^ 1<i<r A\T, e (Ai) 1 is a banded matrix with bandwidth 2(2 rk 0 + 

so) + 1. Theorem 3 ensures that the norms of the difference E 0 — E,) ' = JA A\ E s (/l})' admit the 
required upper bounds. Theorem 3 also paves the way for estimating E j using the banding method 
of Bickel and Levina ( 20081. as E j can be approximated by a banded matrix with a bounded error 
and thus may be effectively treated as a banded matrix. To this end, we define the banding operator 
as follows: for any matrix H = (hij), B r (H) = [hijl{\i — j \ < r)}. Then the banding estimator 
for E , is defined as 


jAn) 


E j 


i i n 


(13) 


where r n = C log (n/ logp), and C' > 0 is a constant greater than (—4 log 5) . Theorem 4 presents 

the convergence rates for E^ , which are faster than those in Bickel and Levina (2008), due to the 
approximate banded structure in Theorem 3. 


Theorem 4. Assume that Conditions 1-5 hold. Then for any integer j > 0, as n, p —> oo, 

||E^ n) - Ej11 2 = Op|r n (?r _1 logp) 1 ^ + £ 2 A+i)+i j _ Q p | log(n/logp)(n _1 logp) 1/2 |, 
and 

IIE^C - Ej||i = 0 P | log(n/log p) (n -1 logp) 1/2 j. 

In practice we need to specify r n . An ideal selection would be r n = arg min r Rj(r), where 

B,(r) = Bdlgp - 2,110, 

but in practice this is unavailable because E j is unknown. We replace it by an estimator obtained 
via a wild bootstrap. To this end, let u \,..., u n be independent and identically distributed with 
E(u t ) = var(wt) = 1. A bootstrap estimator for E j is defined as 


^ n-j 

-J2 Ut (yt-y)(yt+j-y) T - 


For example, we may draw u t from the standard exponential distribution. Consequently the boot¬ 
strap estimator for Rj(r) is defined as 












We choose r n to minimize R*(r). In practice we use the approximation 



k =1 


( 14 ) 


where E* 1; ..., E* are q bootstrap estimates for E ? , obtained by repeating the above wild bootstrap 
scheme q times, and q is a large integer. 

5 Numerical properties 

5.1 Simulations 


In this section, we evaluate the finite-sample properties of the proposed methods for the model 

y t = Ay t _ i + e t , 


where { e t } are independent and N (0, I p ). We consider two settings for the banded coefficient matrix 
A = ( a,ij ) as follows: 

(i) {ciij-, |i — j | < /Co} are generated independently from U[— 1,1]. Since the spectral norm of A 

must be smaller than 1, we re-scale A by r/A/||A|| 2 , where p is generated from f/[0.3,1.0); 

(ii) {ciij ; | i — j | < k () } are generated independently from the mixture distribution £ • 0 + (1 — £) • 

iV(0,1) with pr(£ = 1) = 0.4. The elements {a^; i — j = k 0 } are drawn independently from 

—4 and 4 with probability 0.5 each. Then A is rescaled as in (i) above. 

In (ii), there are about 0.4(2A; 0 — 1 )p zero elements within the band, i.e., A is sparser than in (i). 

We set n = 200, p = 100, 200,400, 800, and k 0 = 1, 2, 3,4. We repeat each setting 500 times. 
We only report the results with K = 15 in ([8]), as the results with other values of K > k 0 are 
similar. TableHjlists the relative frequencies of the occurrence of the events { k = k}. { k > k (j } and 
{k < /;: 0 } over the 500 replications. Overall k under-estimates k 0 , especially when k 0 = 3 or 4. In 
fact when k 0 = 4, k chose 3 most times. The constraint ||A|| < 1 makes most non-zero elements 
small or very small when p is large, and that only the coefficients at least as large as v /log(p V n) jn 
are identifiable; see Condition 2. Estimation performs better in setting (ii) than in setting (i), as 
Condition 2 is more likely to hold at the boundaries of the band in setting (ii). 

The Bayesian information criterion ([7]) is defined for each row separately. One natural alternative 
would be 



where r{k) = (2 p + 1 )k — k 2 — k is the total number of parameters in the model. This leads to the 
following estimator for the bandwidth parameter, 


k = arg min BIC(fc). 

l<k<K 


( 15 ) 
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Although this joint approach can be shown to be consistent, its finite sample performance, reported 
in Tabled is worse than that of the marginal Bayesian information criterion ©, presented in Table 

m 

We also calculate both L\ and L 2 errors in estimating the banded coefficient matrix A. The means 
and the standard deviations of the errors for setting (i) are reported in Table [3 Table [3] also reports 
results from estimating A using the true values for the bandwidth parameter k 0 . The accuracy loss 
in estimating A caused by unknown k 0 is almost negligible. The results for setting (ii) are similar 
and are therefore omitted. 

To evaluate the estimation performance for the auto-covariance matrices S 0 and Si, we set 
k 0 = 3, and the spectral norm of A at 0.8. Furthermore, we let e t be independent and A r (0, S £ ) 
now, where S £ = BB T and B = (b^), 6 n = 1, 6^ = 0.8/(|i — j\ = 1) + 0.6 I{i = j), i > 1 or 
j > 1. Table [4] lists the average estimation errors and the standard deviations over 100 replications, 
measured by matrix Lx-norm. We also report Monte Carlo results for a thresholded estimator and 
the sample covariance estimator. For the banded estimator, we choose r to minimize the bootstrap 
loss defined in (fT4l) with q = 100. For the thresholded estimator, the thresholding parameter is 
selected in the same manner. Table|4] shows that the proposed banding method performs much better 
than the thresholded estimator since it adapts directly to the underlying structure, while the sample 
covariance performs much worse than both the banding and threshold methods. 

5.2 Real data examples 

Consider first the weekly temperature data across 71 cities in China from 1 January 1990 to 17 
December 17 2000, i.e., p = 71 and n = 572. Fig {U displays the weekly temperature of Ha’erbin, 
Shanghai and Hangzhou, showing strong seasonal behavior with period 52 weeks. Therefore, we 
set the seasonal period to be 52 and estimate the seasonal effects by taking averages of the same 
weeks across different years. The deseasonalized series, i.e., the original series subtracting estimated 
seasonal effects, are denoted by { y t \ t = 1,..., 572 }, and each y t has 71 components. 

Naturally we would order the 71 cities according to their geographic locations. However the 
choice is not unique. For example, we may order the cities from north to south, from west to east, 
from northwest to southeast, or from southwest to northeast. By setting d — 1, each ordering leads 
to a different banded autoregressive model with order 1. We compare those four models by one- 
step ahead, and two-step ahead post-sample prediction for the last 30 data points in the series. To 
select an optimum model, we compute (ITOb for these four orderings. These numerical results and 
the selected bandwidth parameters k are reported in Table [5] Three out of those four models select 
k = 2, while the model based on the ordering from west to east picks k = 4. Overall the model 
based on the ordering from southwest to northeast is preferred, which also has the minimum one-step 
ahead post-sample predictive errors. The performances of the four models in terms of the prediction 
are very close. 

Also included in Tableware the post-sample predictive errors of the sparse autoregressive model 
with order 1 obtained via lasso by minimizing 

n p 

^ 1 11)7 ~ Ay t —1 1| + ^ ^ \| a ij\i 

t=2 i,j=1 
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where Ai,..., A p are tuning parameters estimated by five-fold cross-validation as in Bickel and 
Levina (2008). The prediction accuracy of the sparse model via lasso is comparable to those of the 
banded autoregressive models, though slightly worse, especially for the two-step ahead prediction. 
However the lack of any structure in the estimated sparse coefficient matrix A, displayed in Fig|2tb), 
makes such fits difficult to interpret. In contrast, the banded coefficient matrix, depicted in Fig|2ta), 
is attractive. 

As a second example, we consider the daily sales of a clothing brand in 21 provinces in China 
from 1 January 2008 to 9 December 2012, i.e., n = 1812, p = 21. Figj3] plots the relative geo¬ 
graphical positions of 21 provinces and province-level municipalities. We first subtract each of the 
21 series by its mean. Similar to the example above, we order the 21 provinces according to the 
four different geographic orientations, and fit a banded autoregressive model with order 1 for each 
ordering. The selected bandwidth parameters, the values according to (fTOl) and the post sample pre¬ 
diction errors for the last 30 data points in the series are reported in Table |6] We also rank the series 
according to their geographic distances to Heilongjiang, the most northwestern province; see Figj3] 
This results in a different ordering to that from north to south. Table |6] indicates that the minimum 
bandwidth parameter k is 3, attained by the ordering based on the distances to Heilongjiang, fol¬ 
lowed by k = 4 attained by the north-to-south ordering. The post-sample prediction performances 
of those two models are almost the same, and are better than those of the other three banded models 
and the sparse autoregressive model. 

The ordering based on the direction from northwest to southeast leads to k = 12. Therefore 
the corresponding banded model has 21 regressors for some components according to (3), i.e., no 
banded structure is observed in this case. Fig J3] indicates that the ordering from northwest to south¬ 
east puts together some provinces which are distance away from each other. Hence this is certainly 
a wrong ordering as far as the banded autoregressive structure is concerned. 

The estimated coefficient matrix A for the banded vector autoregressive model with order 1 
based on the distances to Heilongjiang and the estimated A by lasso for the autoregressive model 
with order 1 are plotted in Fig® The banded model facilitates an easy interpretation, i.e., the sales 
in the neighbour provinces are closely associated with each other. The lasso fitting cannot reveal 
this phenomenon. 
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Table 1: Relative frequencies (%) for the occurrence of the events {k = k), [ k > k (i } and {k < fc 0 } 


in a simulation study with 500 replications, where k is defined in ([8]). 




Setting (i) 

Setting (ii) 



{k = k 0 } 

{k > k 0 } 

(k < k 0 } 

{k = k 0 } 

[ k > k 0 } 

(k < k 0 } 



k 0 

= 1 

82 

17 

1 

98 

2 

0 

p = 

100 

k 0 

= 2 

87 

8 

5 

95 

3 

2 



k 0 

= 3 

73 

6 

21 

83 

2 

15 



k 0 

= 4 

55 

14 

31 

64 

2 

34 



k 0 

= 1 

91 

9 

0 

97 

3 

0 

p = 

200 

k 0 

= 2 

89 

4 

7 

93 

2 

5 



k 0 

= 3 

65 

3 

32 

83 

0 

17 



k 0 

= 4 

54 

1 

45 

63 

2 

35 



k 0 

= 1 

95 

5 

0 

99 

1 

0 

p = 

400 

k 0 

= 2 

87 

2 

11 

90 

1 

9 



k 0 

= 3 

66 

2 

32 

76 

1 

23 



k 0 

= 4 

45 

1 

54 

60 

0 

40 



k 0 

= 1 

97 

3 

0 

100 

0 

0 

p = 

800 

k 0 

= 2 

86 

1 

13 

91 

1 

8 



k 0 

= 3 

59 

1 

40 

67 

1 

32 



k 0 

= 4 

40 

0 

60 

52 

0 

48 


Supplementary Material 

Supplementary material available at Biometrika online includes proofs of Theorems 1-4, the consis¬ 
tency of generalized Bayesian information criterion defined by (9) in Section 2.3 and the consistency 
of the marginal Bayesian information criterion in the setting /c 0 —)■ oo, as well as the detailed proofs 
of all the lemmas in this paper. 
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Table 2: Relative frequencies (%) for the occurrence of the events {k = A;}, {k > k 0 } and {k < k 0 } 


in a simulation study with 500 replications, where k is defined in (fl5l). 




Setting (i) 

Setting (ii) 



{k = k 0 } 

{k > k 0 } 

{k < k 0 } 

{k = k 0 } 

{k > k 0 } 

{k < k 0 } 



k 0 

= 1 

64 

0 

36 

88 

0 

12 

p = 

100 

k 0 

= 2 

42 

0 

58 

63 

0 

37 



k 0 

= 1 

56 

0 

44 

84 

0 

16 

p = 

200 

k 0 

= 2 

32 

0 

68 

55 

0 

45 



k 0 

= 1 

48 

0 

52 

83 

0 

17 

p = 

400 

k 0 

= 2 

23 

0 

77 

45 

0 

55 



k 0 

= 1 

44 

0 

56 

76 

0 

24 

p = 

800 

k 0 

= 2 

11 

0 

89 

41 

0 

59 


Table 3: Means (xlO 2 ) with their corresponding standard deviations (xlO 2 ) in parentheses of the 
errors in estimating A under setting (i) in a simulation study with n = 200 and 500 replications. 




With estimated k 0 

With true k 0 

P 


\\A-A\U \\A — A\\ 2 

\\A-A\k \\A-A\\ 2 


k 0 = 1 

38(6) 

27 (3) 

37(5) 

27 (3) 

p = 100 

k 0 = 2 

54 (6) 

33 (3) 

53(5) 

33 (3) 


k 0 = 3 

70 (8) 

39(4) 

69 (7) 

38(3) 


k 0 = 4 

85 (10) 

43 (5) 

85 (8) 

43 (3) 


k 0 = 1 

40 (6) 

28 (3) 

40(5) 

28 (3) 

p = 200 

k 0 = 2 

58(7) 

35 (3) 

58(6) 

35 (3) 


k 0 = 3 

74 (8) 

40 (4) 

74 (6) 

40 (3) 


k 0 = 4 

90(11) 

46 (5) 

88(7) 

45 (3) 


k 0 = 1 

43 (5) 

30 (3) 

42 (4) 

30 (3) 

p = 400 

ko = 2 

60 (6) 

36 (3) 

60 (5) 

36(3) 


ko = 3 

77 (8) 

42 (4) 

76 (6) 

42 (3) 


k 0 = 4 

95 (14) 

48 (7) 

93 (7) 

46 (3) 


k 0 = l 

44 (4) 

31(2) 

44 (4) 

31(2) 

p = 800 

ko = 2 

63 (5) 

37 (3) 

62 (5) 

37 (2) 


k 0 = 3 

81(9) 

43 (5) 

80 (6) 

43 (2) 


ko = 4 

98(14) 

49 (7) 

96 (7) 

47 (2) 
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Table 4: Means with their corresponding standard deviations in parentheses of the errors in estimat¬ 
ing autocovariance matrices in a simulation study with n = 200 and 100 replications. 



||£n,0 — So||i 

I Sn,i || r 


Banding 

Thresholding 

Sample 

Banding 

Thresholding 

Sample 



Matrix L] -Norm 



Matrix L \ -Norm 


p = 100 

2.1 (0.04) 

2.6 (0.02) 

14 (0.07) 

2.9 (0.03) 

3.5 (0.04) 

14 (0.07) 

p = 200 

2.7 (0.04) 

3.4 (0.03) 

29 (0.02) 

3.1 (0.03) 

4.2 (0.04) 

30 (0.02) 

p = 400 

2.3 (0.02) 

2.9 (0.02) 

55 (0.02) 

2.8 (0.03) 

3.7 (0.02) 

55 (0.02) 

p = 800 

2.7 (0.03) 

3.4 (0.02) 

112(0.03) 

2.9 (0.03) 

3.9 (0.03) 

110(0.04) 



Spectral Norm 



Spectral Norm 


p = 100 

1.1 (0.01) 

1.4 (0.02) 

4.0 (0.07) 

1.4 (0.01) 

1.7 (0.02) 

3.7 (0.02) 

p = 200 

1.3 (0.03) 

1.7 (0.02) 

6.5 (0.03) 

1.5 (0.01) 

1.9 (0.01) 

6.1 (0.02) 

p = 400 

1.2 (0.01) 

1.6 (0.01) 

10 (0.03) 

1.3(0.01) 

1.9 (0.01) 

9.2 (0.02) 

p = 800 

1.4 (0.02) 

1.8 (0.01) 

17 (0.03) 

1.4 (0.01) 

2.3 (0.02) 

15 (0.03) 


Table 5: Results of Example 1: Estimated bandwidth parameters, Bayeysian information criterion 
values and average one-step-ahead and two-step-ahead post-sample predictive errors over 71 cities 
with their corresponding standard errors in parentheses. 


Ordering 

k 

BIC 

One-step ahead 

Two-step ahead 

north to south 

2 

552.5 

1.543 (1.170) 

1.622(1.245) 

west to east 

4 

555.9 

1.545 (1.152) 

1.602(1.247) 

northwest to southeast 

2 

552.4 

1.552(1.167) 

1.624(1.249) 

southwest to northeast 

2 

551.9 

1.538 (1.160) 

1.617 (1.253) 

Lasso 

- 

- 

1.545 (1.172) 

1.632(1.250) 


Table 6: Results of Example 2: Estimated bandwidth parameters, Bayeysian information crite¬ 
rion values and average one-step-ahead and two-step-ahead post-sample predictive errors over 21 
provinces with their corresponding standard errors in parentheses. 


Ordering 

k 

BIC 

One-step ahead 

Two-step ahead 

north to south 

4 

114.9 

0.314 (0.377) 

0.407 (0.386) 

west to east 

7 

115.2 

0.323 (0.363) 

0.409 (0.386) 

northwest to southeast 

12 

115.2 

0.322 (0.361) 

0.409 (0.395) 

southwest to northeast 

5 

115.1 

0.316 (0.374) 

0.407 (0.385) 

distance to Heilongjiang 

3 

114.7 

0.313 (0.378) 

0.407 (0.386) 

Lasso 

- 

- 

0.322 (0.362) 

0.410(0.393) 
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Figure 1: Deseasonalized weekly temperature in degrees Celsius (°C) from January 1990 to Decem¬ 
ber 2000, where Ha’erbin, Shanghai and Nanjing correspond to the plots from top to bottom. 



(a) 



1 O 20 30 40 50 60 VO 

(t>) 


Figure 2: Example 1: (a) Estimated banded coefficient matrix A for the model based on the ordering 
from southwest to northeast, and (b) estimated sparse coefficient matrix A by lasso. White points 
represent zeros entries and gray or black points represent nonzero entries. The larger the absolute 
value of a coeffcient is, the darker the colour is. 
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Figure 3: Location plot of 21 provinces and province-level municipalities in China, where Shang¬ 
hai is a province-level municipality, and Ha’erbin, Hangzhou and Nanjing are the capitals of Hei¬ 
longjiang, Zhejiang, and Jiangsu provinces, respectively. 



Figure 4: Example 2: (a) Estimated banded coefficient matrix A for the model based on the ordering 
using distances to Heilongjiang, and (b) estimated sparse coefficient matrix A by lasso. White points 
represent zeros entries and gray or black points represent nonzero entries. The larger the absolute 
value of a coeffcient is, the darker the colour is. 
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Supplementary Material: High Dimensional and Banded Vector 

Autoregressions 


Abstract 

This supplementary material is organized as follows. We provide the detailed proofs of 
Theorems 1-4, respectively, in Sections A. 1-A.4. Section A.5 presents Proposition 1 and its 
proof, showing the consistency of generalized Bayesian Information criterion stated in Remark 
1 in the paper. In Section A.6, we present the consistency of the marginal Bayesian information 
criterion selector k in a more general setting when ko -o oo. Some technical lemmas and their 
proofs are collected in Section A.7. Section A.8 presents some additional simulation results. 


A.l Proof of Theorem 1 


Without loss of generality, we consider the VAR(l) model with || A|| i < 5 < 1. Our goal is to prove 
that pr(fc = k 0 ) —> 1, i.e., pr (k ^ k 0 ) —>■ 0. If k ^ k 0 , then either k > k 0 or k < k 0 holds. Hence 
it suffices to show that pr (k < k 0 ) —> 0 and pr (k > k 0 ) —>• 0. Our proof follows the arguments in 


Wang, et al. (2009). 


Consider the first case. Observe that pr (k < k 0 ) < pr (ki < k 0 ) for some i e {1,... ,p} and the 
event (ki < k 0 ) imply {rnin fc<fco BICj(fc) < BICj(A;o)}. To prove pr (k < k 0 ) —>■ 0, we only need to 
show that 

pr{minBICj(A;) < BICj(/co)} —> 0 

k<ko 


for some i. Suppose that we have shown that there exists a constant r/ > 0 and an event A n such that 
pr(^4 n ) —>■ 1 as n —> oo and on the event A n , 


RSSi (k) - RSSi(feo) > + < i+lo ), (A.l) 

for sufficiently large n, where a ? j, is the (j, k) -element of Ai. On the event A n with large n, 
log RSSj (k) -log RSSj (ko) > log{l + 77(a? i _ fco + a? i+fco )}. Note that log(l + x) > min(0.5x, log2) 
for any x > 0. Consequently, with probability tending to one, log RSS,(/;:) — log RSS ) :(/ 1 'o) can be 
further bounded below by niin{0.5r/(a ) 2 ) -_ / , o + af i+ko ), log 2}. Condition 3 implies that for some 
i* E {1,... ,p}, a 2 , j»_ feo + a 2 , i* +k C n log p/n as n — > oo. Hence, it follows that, with proba¬ 

bility tending to 1, 


minBICj*(/c) — BIC ? ;*(fco) 

k<ko 


> min{0.5r/(a 2 v *_ fco + a 2 log2} 

—C n kon~ l log(p V n) > 0, 


where pVn = ma x(p,n). Hence, pr{min fc<fco BICj* (k) < BIQ*(fc 0 )} —> 0 and thus pr(fc < k 0 ) -A 

0 . 

Let us prove (|AT]). For k < k 0 , denote H i)k = X hk ^XJ k X hk y { XJ k , X iM = (S^,X itk , S$) 
and /3 it k 0 = /3J k , bjy T , where X l k is defined similar to (4) in Section 2.2 except that k 0 is 
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replaced by k. Then RSS i{k) = yL (l n ~\ — and by Lemma 5 (ii) or Lemma 6 (ii), we 

have 


RSSj(fc) — RSSj(fco) 






+ Op(l). 


From Lemma 5 (i) or Lemma 6 (i) and Lemma 7, there exists a small constant r] > 0 such that, with 
probability tending to one, 




<?)' 


V - 5*2)} > ri( 1 + ,,)nal 


and RSS ( (A’o) < naf (1 + //). Therefore, (IA.il) follows. 

Now we turn to the overfitting case, i.e., pr(fc > fc 0 ) —> 0. For k > A:q, set 





(o T ,PI, ko ,oT,s hk = 



and S i)k = (J n _i — H i)ko )S i)k . Let r] be an arbitrary but fixed positive constant and define 

RSS i(k) 


B n = < inf inf 


k 0 <k<Kl<i<p na~ 


>( 1 - 


°n= IJ \xj n (n 1 Sf k S i , k ) < ^(l + rj), sup (n 1 £% k S itk ) < k 2 (1 + 


l<i<p 

ko<k<K 


l<j<k-k 0 


We first give an upper bound on RSSj(fc 0 ) — RSSj(fc) for k > k 0 . For each i, RSS i(k) can be 
rewritten as 

RSS i(k) = inf ||y (i) - X^bf = inf ||y (i) - X^b x - S t b 2 1| 2 . 

b ' b\,b2 x 

It can be verified that RSS i(k 0 ) = ||(/ n _i — H ijko )y^\\ 2 and 

RSS l (k) = RSS i (k 0 )-\\S^b 2 \\ 2 , 

-l ~ 


where b 2 = ( S' k S,.j. : ) S/ k e( l ). Then on the event C n we have 


RSSj(/c 0 ) 


RSS i(k) 


= ^SiAsiAk^si^ 

< ^(1 + 7)1 n(k) - Ti(ko )| sup |n“ 1/2 ey ) (/ n _ 1 

j,k<p 


Hi ,k 0 ) x (k) | 


Define 

V n = { sup \n~ 1,2 e T (j) (I n - 1 - Hi, ko )x (k) \ 2 a^ 2 < ^ C n log(p V w)|. 

j,k<p 1 ~r 7] ) 

On the set B n fl C n H V n , for all k with k 0 < k < K, 

RSSi(fco) - RSSi(fc) < of (1 — v)\ T i(k) ~ Ti(k 0 )\C n log(p V n) 

< RSSi(k)C n \Ti(k) -7-j(/c 0 )|'« _1 log(pV n). 
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Note that log(l + x) < x for any x > 0. Hence, for all k with k 0 < k < K, on the set B n D C n fl V n , 

BICj(/c) — BICj(/co) = log RSSi(k) -logRSS i(k 0 ) + C n \Ti(k) - Ti{ko)\n~ l log(p V n) 

> - (RSSj(A; 0 ) - RSS i(k)} {RSSi(/c )} _1 
+C n \Ti(k) - Ti{ko)\n~ l log(p V n) > 0, 

which indicates that over the set B n D C n D V n , we have that k < k 0 . To prove that pr(A' > 
h 0 ) —> 0, it suffices to show that pr { (B n (T C n fl £> n ) c } —>• 0. In fact, it follows from Lemma 7 
and Lemma 5 or 6 (i) that pr (£L) —>• 0 and pr (C'j) —y 0. It remains to show that pr I'D';) —y 0. 
Let = n-'B (Xl„X kk ), E = n l XJ k X l k , where E(X) denotes the expectation of X. Set 
H, l)k = n^X^llXlb, and x (k) = (/ n _ 1 - H itk )x( k ). On the event C n , we obtain that 

sup - H i ko )xf k ) I 

j,k<P 

< sup |e^-)X( fc )| + sup |- H iiko )x( k) \ 

j,k<p j,k<p 

< sup \e T {j) x {k )\ + n~ l sup II||2 1 |II2II^fc 0 IIa||S ijfco - E i)fco || 2 ||X^ o x (fc )|| 2 

j,k<p j-,k<p 

< sup | e Jj)X(k) | + k 0 K^ 2 k 2 (l + v ) 2 sup | e(j)X(k) \ ■ \\%i,ko ~ ^i,k 0 h, 

j,k<p j,k<p 

where sup i <k<p (n~ 1 X(k)xJ' k )) < k 2 (1 + rj) is used in the above inequality. Hence, it follows from 
Lemmas 5 and 6, together with Condition 3, that pr (D£) —> 0 as n —> oo. This completes the proof. 


A.2 Proof of Theorem 2 


Since the autoregressive model with order d can be formulated as a autoregressive model with order 
1, without loss of generality, we consider the case of order 1 only. With probability tending to one, 
k = fc 0 , and thus it suffices to consider the set A n — {k — k 0 }. Over the set A n , for each i, 

j3 i -Pi = (X?X i )- 1 X?e ( i ) . (A.2) 

For each i, the law of large numbers for the stationary process case yields that rC 1 X'[X, converges 
to a positive matrix almost surely, and furthermore, with probability tending to one, A min {n-'XJXi) 
is bounded away from zero. As a matter of fact, if we define 

Bn= f| {Aminf-^Af > Kl (l-r})} 

l<i<p 


with a small constant rj G (0,1), then it follows from by Lemma 5 or Lemma 6 under different 
moment conditions that P{B n } —> 1 as n —> oo. Hence, over the event A n fl B n , 


Pi-Pi < «i (1 ~rj) 


2 n ~ 2 WeJ^Xi 


C\ n 2 
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where C\ = (1 — ?/) -2 > 0. It is not hard to see from Lemma 5(ii) or Lemma 6(ii) that, for 

all 1 < i < p, n~ l E\\XJe(i )\\2 < C 2 with some constant C 2 > 0. Therefore, for a large positive 
constant C, we obtain that 


pr 


A 1 — Ai 


> Cn p = pr 


At.-A} 


> Cn p, A n n B n + pr (A n n B n ) 


= (Cpy'n^n-^E ^||A7e (0 ||2j +pr(A n r\B n ) 

= C' 1 C' 2 C ' -1 + o(l). 

We establish the convergence rate of ||Ai — A\^p by taking a sufficiently large C. 

Now we derive the convergence rate of ||/li — Ai\\ 2 . For any matrix B, \\B\\\ < ||-B||i||-B||oo- 
Hence, on the event A n , 

||^4i ~~ ^llh < \/||^4i — ^illivll^i ~~ ^4i||oo < (2fco + 1) sup \Pij — Aj|> 

i<p,j<n 

where /3 tJ and /3 tJ are the j-th element of 7 and 7, respectively. Observe from (IA.2I) that 

sup \^ij -/3ij\ = K^(l - p)~ 1 (2k 0 + l)( sup |e^y)|),i = 1,... ,p. 

i<p,j<n i<P,j<Ti 


Hence, using Lemma 5(ii) or Lemma 6(ii), we have 

sup | - /3ij | = 0 P < [rC 1 log p) 

i<p,j<n *■ 

which shows that 


1/2 


\\Ax-A 1 \\ 2 = 0 P {(n Mogp) 


1/2 


The proof is completed. 


A.3 Proof of Theorem 3 

The covariance matrix £ 0 can be expressed as 

OO 

S 0 = S £ + B i = JA j J T X £ J(A T ) j J T , j > 1, 

3 = 1 

where J = (l pxp , 0 px ^- i) P )- Let = JA j J T , j > 1. By the companion matrix A, we can show 

that T 0 = I p and T ? = ^ j-kAk, j > 1. It is easy to see that for two banded matrices F and 

G with bandwidths 2r, + 1 and 2 r 2 + 1, respectively, the product matrix FG is also banded and its 
bandwidth is at most 2(ri + r 2 ) + 1. Therefore, it can be verified that Tj is banded with bandwidth 
at most 2 jk 0 + 1 and then B :] is also banded with its bandwidth at most 2(2 jk 0 + s 0 ) + 1 for j > 1. 
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Take Eq ■* = E £ + Y^'j=i Bj, which is banded with the bandwidth at most 2(2 rk 0 + s 0 ) + 1. and 
S 0 — Sq ' = ^°l r+1 Bj. Note that for any j > 1, \\Bj \\ 2 < ||E £ || 2 ||2L 2:; || 2 < C5 2j for some C > 0. 
Write C\ = C||E £ || 2 (1 — 5 2 ) \ It follows that 

OO 

||S 0 -S( r) || 2 < \\BjW 2 < C'||S £ || 2 (l — <5 2 ) -1 <5 2 ( r+1 ) = Ci<5 2 ^ +1) . 

j=r +1 

By using the inequality \\Bj\\ 1 < {2(2j/c 0 + s 0 ) + 1}11-By11 2 < C(2j + 1 )5 2j for some C > 0, we 
obtain 

IlSo-S^Hi < C 2 r<J 2(r+1) . 

Other inequalities can be proved analogously. The proof is complete. 


A.4 Proof of Theorem 4 


.. —*- c^> \ - .. —*- (j. \ . 

Now we prove the convergence rate of ||E^q — S 0 || 2 . First, ||E„ q — E 0 || 2 can be bounded above 
by 

I|y4 r n) y-4 r n)|| , Ily-iCn) V II _ D I D 

\\^ n ,0 ~ ^0 ||2 T || 2 jq — - 2^0 ||2 — + r £ n2 . 

Similar to Theorem 2, R nl < (4 r n k 0 + 2s 0 + 1) sup j fc<p |E jfc — E jfc |. From Lemma 5(i) or Lemma 
6(i), we obtain that 

Bn 1 


i = 0 P jr. n (n Mogp) 172 } 


From Theorem 3, R n2 < 0 (< 5 2<r " +1 )). Note that r n = C'logjnlog 1 (p)} with C > (—4log S) l . 

Combining these results, it follows that 


lE^-Solb = 0 P {r n (n- 1 logp) 1/2 + 8 2 <- r ’' +1 '> 

log{n log _1 (p)} (n _1 logp) 1 ^ 


= 0 } 


The proofs of other results are similar and omitted. 


A.5 Proposition 1 and its proof 

PROPOSITION 1. Under Conditions 1-4 in Section 3.1 of the original article, we prove thatprfk = 
ho, d — d) —> 1 as n, p — * oo. 

Proof of Proposition 1. Our primary goal is to prove that pr (k — k 0 , d — d) —r 1, i.e., 

pr{(fc k 0 ) U (d f d)} —> 0. 

Note that 

pr{(/c f kf)C (d f d)} < pr (k < k 0 ) + pr(d < d) + pr(A; > k 0 ,d > d). 


21 



We observe that both events {k < k 0 } and {d < d} correspond to the underfitting case, where some 
important variables are missed in the estimated model. Hence, following the proofs of Theorem 1, 
we can show pr (k < k 0 ) + pr (d < d) —> 0. 

It remains to prove that pr (k > k 0 , d > d) —» 0. First look at the event A = [k > k (h d > d!{. 
Define Ai = U {ki > k 0 ,di > d\, A 2 = U {ki < A 4 > d\, A 3 = U {ki > k 0 ,di > d\, 

i<p ^ J i<p 1 J i<p ^ J 

and A = U {k t > k 0l di < d\. Then A C A U A 2 U A 3 U A, which implies that it suffices to 

show pr(A) —> 0 for each k = 1 ,..., 4. Observe that both events A and A correspond to the 
overfitting case, where all important variables as well as some unimportant variables are selected by 
the estimated model. Hence, following the proofs of Theorem 1, we can show pr(A) +pr(A) —;> 0. 
Now we are going to prove pr(A) —* 0 as n,p —> 00 . For each i, {ki < k 0 , di > d} means 
min BIC;(fc, (!) < B IC,(A d). Hence, we only need to show, with probability tending to one, 

k<ko,£>d 


min min < BlCi(k,() — BICTA d) > > 0. (A.3) 

i<p k<k 0 ,d<e<L l J 

Suppose that we have shown that there exists a constant 77 > 0 and an event Q n such that pr((/ n ) —> 1 
as n —y 00 and on the event Q n , 

min {RSSi(/c, () — RSS,(A d) — r^RSS^/co, d)Ai > 0, (A.4) 

i<p 1 

for each k < k 0 , d < ( < L and sufficiently large n, where A* = Y^=i {( a u-fc 0 ) 2 + ( a u+/c 0 ) 2 } • As a 
result, on the event Q n with large n, minj< p { log RSS i(k, () — log RSS;( A d) — log(l + 77 A*)} > 0. 
Note that log(l + x) > min{0.5:r, log 2} for any x > 0. Then, with probability tending to one, 
logRSSj(fc)—log RSS * (/c 0 ) can be further bounded below by min(0.5 A, , log 2). Condition 3 implies 
that minj< p Aj C n n ~ 1 log(p V n) as n —> 00 . Hence, it follows that, with probability tending to 
1 , 

BICj(fc, £) — BICj(A d) > min( 0 . 577 Aj, log 2 ) — C n Ti(k 0 , d)n~ l log(p V n) >0 (A.5) 


uniformly for all k < A d < ( < L and i — 1 ,..., p. Hence, pr( A) —>■ 0 as n, p —> 00 . 

Let us turn to prove (IA.4I) . For k < ko and d < ( < L, denote Am = Am(^M X *>m) \x^ e , 
where Am is defined as in section 2.2 but replaced k 0 and d by k and (. Then RSS i(k,() = 
yji) (4-1-Am) A)- Infact ’ X M canbe rewritten as X M = (S^ v X$,S$ t2 ,S$ A , X$, l 
and, similarly, (3j ko/ = b \Let S itt = (S^, S$ j2 ,S$ A , S$ t2 ) 

and bj f = (bfl , Ig ),..., bf \, bfl). As a result, by Lemma 5 (ii) or Lemma 6 (ii), we have 


max |RSS,(M) - RSS t (k 0 ,d) - 4^(4-i - A,m)4Am| = o P { 1). 

i<p ’ ’ 


From Lemma 5 (i) or Lemma 6 (i) and Lemma 7, there exists a small constant 77 > 0 such that, with 
probability tending to one, 

Amin{££*(/ - Am) 1 ^} > VO- + r i) n(T h 

and RSS i(k 0 , d) < naf{ 1 + 77 ). Note that bjjb^e > Y? 3 =i {( a %i-k 0 ) 2 + i a %i+k 0 f}- Therefore, (IA.4b 
follows. 

In a similar manner, pr( A) —^ 0 can be proved. The proof is completed. 
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A.6 Proposition 2 and its proof 


PROPOSITION 2. Under Conditions 1 ’ and 2-4 in Section 3.1 of the original article, pr{k = ko) —> 
1 as n, p —> oo, provided k 0 -C C~ 1 n/ log(p V n). 

Proof of Proposition 2. First, we can prove the conclusions of Lemma 5, 6 and 7 under the Con¬ 
ditions 1’ and (2)-(4). For instance, in the proof of Lemma 5, we bound ||2li||oo in (IA.1QI) by 
KIU < C6 l under Condition 1’. Similarly, the inequalities (IA.1 II) and (IA.12I) in the proof of 
Lemma 6 can be bounded in a similar way. Then, following the proof of Theorem 1, we can prove 
the consistency of Bayesian information criterion selector k in the general setting k 0 > oo. 


A.7 Seven technical lemmas and their proofs 


We first adopt the asymptotic theories using the functional dependent measure of Wu (2 0051) . As¬ 
sume that Zi is a stationary process of the form z, t = g(E t ), where g(-) is a measurable func¬ 
tion and Ti = (..., e_i, e n , •. • , e, : ) with independent and identically distributed random variables 
{ei,i = 0, ±1,..Wu (2005) defined the functional dependent measure in terms of how the out¬ 
puts are affected by the inputs. To be specific, denote \\z\\ q = {E^z^)} 1 ^ with q > 1 for a random 
variable z. The physical or functional dependent measure is defined as 


0i, q = \\Zi - z*\\ q = || g{Fi) - g(E* 
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where z* = g{3F*) is the coupled process of z^, IF* = (..., e_i, e^,..., ef) with {e^, e 0 } being 
independent and identically distributed. Intuitively, 0 t/j measures the dependency of Zi on e 0 while 
keeping all other innovations unchanged. 


Lemma 1. (Theorem 2 (ii) of lLiu. Xiao and Wu I ( 1 2013) ). Let S n = n -1//2 Y^h=\ z i ant 1 = 

A ssume that for each m, Q m , q — 0(m ~ a ) with a > 1/2 — 1 /q and q > 2. Then there 
exist positive constants Cf C-i and Cf which only depend on q such that for all x > 0, 


pr(\S n \ >x)< 


giQg.,» 


+ C 3 exp (C 20 0 ^ q x 2 ). 


To prove the limit theory for the sub-exponential tail case under Condition 4(ii), we shall use 
Lemmas 2-4. 

Lemma 2. Suppose that X is a random variable. Then, E {exp(f 0 |X | ^)} < oo for some 0 < v < 2 
and to > 0 if and only if 

lim sup q- 1/v \\X\\ q < oo. 

q— >oo 

Proof. Assume that ( = i7{exp(fo/Y| u )} < oo. Then, for any q > 2, 

/♦OO 

E(\X\ q ) — q x 9_1 pr(|X| > x)dx 


< (qv \ q/v 


x q ! v x exp 


(~x)dx = C qv \ q/v T , 
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where T(-) is the Gamma function. By Stirling’s formula, 

lim T(x + 1)| (2nx) l t 2 j | =1, 

we obtain that for all sufficiently large q, 

where C is a constant depending on (, v and f 0 only. This implies that 

lim sup g _1 /' u ||X|| ? < oo. 

q —^oo 

Conversely, assume that lim g“ 1//i; ||A^|| g < oo. Then, there exists a positive constant 

0o > 0 such that, ||A|| ? < 0 O Q l ^ v for all Q > 2. Note that exp(x) = 1 + J2 k>1 (k\)~ 1 x k . To prove 
that f?{exp(f 0 |A"| ,; )} < oo for some f 0 > 0, we only need to show that there exist positive constants 
t 0 and k 0 such that 


E 

k>ko 


toiler 


vk 


k\ 


< oo. 


By Stirling’s formula, there exists a large integer k 0 such that for k > k 0 . 


T (k + 1) = k\ > (irk) 1 / 2 ^ 


With such k 0 and t 0 = ( 2<j) l Qve ) \ we have 




k>k 0 


k\ 


sr' M v 0 ve) k k k k 

- f7rfrU/2frfc - < ' 


k>k 0 


k) l / 2 k k 


k>ko 


□ 


Lemma 3. Suppose that {Xi ,..., X n } are independent random variables and sup ;Xn E {exp(t (J | X r \ a )} < 
(for some positive constants a, t 0 and ( with 0 < o < 1. Then there exist positive constants Cj > 0 
(j = 1,. .., 4) which depend only on a, t 0 and ( such that for any x > 0 and all n, the following 
concentration inequality holds: 


n 

HI Ew - 

i= 1 


> 3x 


< Ci exp 
+C\ exp 


x 


1 — ex. 

Cyn + C 3 n 2 ~ a x 


x 


2 a 


Cyn 2 ~ a + C 3 x° 


+ nC\ exp (—C 4 a; a )(-A.6) 


In particular, if a = 1, then 

n 

pr[| Y,{*i ~ E(Xt)} 

i= 1 

for any x > 0 and n. 


> 3a: 


< Ci exp 


x 


Cgn + C 3 x 


+ C\ n exp (— C/ix) 
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Proof. For the case of a = 1, (IA.6I) can be proved by Bernstein’s inequality directly. So here we 
consider the case of 0 < a < 1 only. Let £ n ^and £ n2 be two constants with 0 < < £ n2 , which 

depend on n and will be defined below. Let X,\ = XJilX,] < £ nl ), X i2 = Xj /(£ n1 < X, < £ n2 ) 
and X i3 = XiI(\Xi\ > £ n2 ). Then X t = X a - E(X a ) + X i2 - E(X i2 ) + X i3 - E(X i3 ), and hence 


pr[| - E(X,)} 

i= 1 


> 3x 


3 n 

£ EP r [|E{'^--B(A', fc )} 

fc=l *=1 


> x 


In the following, we will give an upper bound on each term separately. 

Now consider the first term. Let cr 2 be a finite constant such that sup i<n E\Xi\ 2 < cr 2 . Note that 
| jfji| < | and LA 2 < cr 2 for all i. By Bernstein’s inequality for bounded variables, we get that 

„2 


pr |^{X il -S(Y il )}|> 


X 


< 2 exp 


x 


1=1 


2na 2 + 2£ nl x/3 7 


(A.7) 


Let us handle the second term. To use Bernstein’s equality, we only require an appropriate 
control of moments. Using integration by parts, we observe that 

E( \x i2 \ q ) <q M 9_1 pr(|Xi| > u)du + CCP r (l^l > f n i) 
for q > 2. For integer q > 2, 

r£n 2 fin 2 

q / rt 9 ^ 1 pr(|A 7 'j| > u)du <q( u q ~ l exp(— t 0 u a )du 


£nl 


£nl 


r-io?“ 2 /2 


< qa- l C,(2t^) q/a / u qla ~ x exp(-2u)du 




r*o%/2 


< qa 1 C(2t 0 1 ^ 2 “) f/ exp(-2 ^oC) 

— 1 /■ / _»— 1 /-l— a \^ / q —r-CK \ (c\j_— 1 /-l— Q —2 


■u 9 1 exp(— u)du 

< g!4a _1 C(to :1 &T“)' exp(-2 _1 t 0 C) ( 2i o ^2 “)' 


Choose = {4f 0 X (1 — a)/{2 — a) logn} 1 /" and £„ 2 = n 1 ^ 2 V x. Write = rC 1 °^ 2 A anc [ 
^ = max(16C« _1 fo 2 , cr 2 ). Then 

rin2 


q / u 


Q— 


V(|*i| > u)du < L -q\u{l y x 2 ^C 2 }{2tf\i n V^r 2 


We also have that ^prdAd > f nl ) < £ 9 i exp(-f 0 Ci) = Co exp(-f 0 Ci)G 2 - A sim P le manipu¬ 
lation yields that there exists a positive integer N o to which depends only on a and t 0 such that 

Co < Cn 21 ^ni ex P(-^oC) < 4a _1 CV 2 ; 2t o 1 ^n 2 Q > Co, and 4 log n < t 0 ^, 
if n > A" Qj i 0 . Then, if x < n 1 /! 2- "), 

B(|V j 2|’)<C !l, ( 2i 0 _1 U^ 2 
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for q > 2; otherwise, 


E(\X i2 \ q ) < ^q\u{x 2 ^- a ^- 2 }(2t^x 1 - a ) q 2 


for q > 2. By Bernstein’s inequality, we obtain that 


pr[|£{X i2 -.E(X i2 )}| > 


X 


i= 1 


< 2 exp 
+2 exp 


x 


2nv + 4f n 1 n 2 ~ a x 


x 


2 a 


2 un 2 - a + 4 t n 1 x° 


For the last term, we note that 

n 

[l £{*«-£(*«)}! > 


X 


2=1 


C \ sup \Xi\ > £ n2 j U | sup |Xj| < £ n2 , 

n 

J2\ E ( X iI(\ X i\ > £ n2 ))| 


> I . 


i=l 


Therefore, we have 

n 

pr[| ^(A'o - S(X«)} 

2=1 

< prfsup \Xi\ > t, n2 ) + pr sup \Xi\ < £ n2 , ^ \E{XiI(\Xi\ > £ n2 )}| > 


> x 


X 


i= 1 


(A.8) 


Note that ( = sup i<n £’{exp(f 0 |A' i | a )} < oo. We observe that 

pr(sup |Xi| > £ n2 ) < Cnexp ( - t 0 C 2 ) < C^exp ( - t 0 x“) • 

In a similar fashion, we obtain that 

n 

\E{XiI(\Xi\ > £ n2 )}| < na{pr(\Xi\ > £ n2 )} 1/2 < n“VCexp (2 log n - 2 _1 t 0 £" 2 ). 


2=1 


As a result, for x > cr(n 1 and n > N ajto , 

n 

pr[|J]{A B -B(A' 3 )}|> 


X 


i =1 


< (n exp ( — t 0 x' 


(A.9) 


Combing the three inequalities (IA.7I) - (IA.9I) . we conclude that, for x > cr(n 1 and a > N nJf> , 


prfl EfA'i - E(X i )} 


2=1 


> 3x 


< 4 exp 
+2 exp 


x 


2 nv + 4 t 0 1 n 2 ~ a x 

2 a 


X 


2vn 2 ~°‘ + 4 t Q l x a 


+ nQ exp (—t 0 x a ) 
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If x < a(n 1 or n < N ajto , we can always multiply a large positive constant C on the right hand 
side to make the inequality hold. The proof is completed. □ 


Lemma 4. Suppose that {X\ = (X^i, Xl j2 ) t , X 2 = (X 2} i,X 2j2 ) T are independent random- 
vectors and sup i<nj - =1 2 E {exp{t 0 \X it j\ 2a )} < (for some positive constants a, t 0 and ( with 0 < 
a < 1. Denote by l n a sequence that may depend on n, and 1 < l n < 0(n e ) with 0 < e < 1. Then, 
for each m and m! with m, m! = 1, 2, there exist positive constants Cf (j = 1,..., 4) such that for 
any x > 0, the following concentration inequality holds: 


pr 


n 

[| E« E(X itin X i+ i ntin i )}| > 3 (f n + l)x 

i =1 


< (In + 1 )Ci exp 


X 


1 — a 

C 2 n + C 3 n 2 ~ a x 


x 


2a 


+ Ci (l n + 1) exp ( — ——^-—— 

C 2 n 2 - a + C 3 x c 


+Ci(l n + l)nexp (~C A x a ) 


Proof Without loss of generality, we assume that n/(l n + 1) is a positive integer. Here we prove 
the inequality for m — 1 and m! = 2 only. Similar techniques can be applied to other cases. Let 

Yji = X {i _ mn+l)+ji iX i(ln+1)+ j_ li2 . Then, for each j, {Y jU i = 1,..., n/(l n + 1)} are independent 
with supij -E{exp(f 0 |Lji| a )} < C < oo. With the help of Y^, ]T" =1 {X iA X i+lnj2 - E(X it iX i+lni2 )} 

can be re-expressed as 

Tl Zri —|— 1 Tl'/(ln~\~^) 

2=1 j =1 2=1 

By Lemma 3, we obtain that there exist positive constants Cj(j = 1,..., 4) such that 


pr 


n/{ln+ 1 ) 


E {Y, i -E(Y, i )} 


i= 1 


> 3a: 


< C\ exp 


+ Ci exp 


x 


1 — Oi 

C 2 n + C 3 n 2 ~<*x 


x 


2a 


C 2 n 2 ~ a + C 3 x° 
+ C\n exp (— CiX a ), 


for each j = 1,, l n + 1. Note that 


2=1 

Therefore, 


n n/(l n -\-l) 

y2{X i ,iX i+ln>2 -E(X it iX i+lnt2 )} \<(l n + l) sup I V {Y Tl -E(Y Ti )} 

j<ln +1 1 


2=1 


pr 


n 

[| E { X aX<+l „2 - E(X,_ iA' i+ln , 2 )}| > 3(i„ + l)x 
2=1 


n/(ln + 1 ) 

< (l n + 1) sup pr j: E {^-swo} 


j<ln ~\-1 


2=1 


> 3a; 


The lemma is proved. 


□ 
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Lemmas 5, 6 and 7 below are based on the autoregressive model with order 1 under ||Ai|| 2 < 
S < 1. Similar techniques can be applied to the general cases of order d. For j,k = 1,... ,p, 
define E jk = n~ ] Yu=\ Vj,tVk,t and E jk = E(E jk ). For i = 1,.,. ,p, let e (i) = (e i>2 , • • •, £i, n ) T and 
X(i) = (y^ i,..., y i)r) _i) T . We should note that Lemmas 5 and 6 have the same rate expressions but 
the actual rates are different, since they are under Conditions 4(i) and 4(ii), respectively. 


Lemma 5. Suppose that Conditions (l)-(3) and 4(i) in Section 3.1 of the original article hold. 

(i) For j,k = 1..... p, there exist positive constants C \, C 2 and C : > free of (j, k. n, p) such that 


pr 


J j k k 


> X < 


Cm 

(nx) q 


+ C 2 exp ( - Cyrix 2 ) 


holds for x > 0; consequently, this leads to the following uniform convergence rate: 


sup 

l<j,k<p 


k k 


= Op\(n Mogp) 1 / 2 


(ii) For j, k — 1.... , p, there exist positive constants C \, C 2 and C : > free of (j, k, n, p) such that 

pr{\e T U)X{k) \ >x}<^- + C 2 ex p ( - Cyx 2 ) 
holds for x > 0; in particular, we have 

sup | C {j ) x {k) | = 0 P | (n log p ) 1/2 i. 


Proof. Here we prove part (i) only. Part (ii) can be proved analogously. Let p q = sup ;<p ||£ j0 || g for 
q > 2. To use the results of Lemma 1, we just need to bound the physical dependent measure of 
ypiVkA for each j and k, denoted by 9i, q ,j, k = \\Vj,iVk,i ~ if .3)1 Aq wilh U),i being the coupled process 
of y hl . Denote the physical dependent measure of y } , % by 0 l 2q ] = \\y :) l — y* t \\ 2q with y* d being the 
coupled process of yj ;i . 

We will show (a) sup^ 11 2 /^,*11 2g < Cp 2q ; (b) sup j< p 0 it2q j < Cp 2q (i + 1)5*, where C is some 
positive constant and depends only on the spectral norm of A x rather than q. Observe that \\yj q y k y — 
VjyVkyWq — II Vj^Vky Vj^Ukyllq || yj,iPk,i yj,iyk,i\\q <ind hence 

II Vj,iVk,i - VliVlyWq < sup \\y j; i\\ 2q (0i M)j 4- 0i,2q,k)- 

j<P 

If both bounds (a) and (b) are obtained, then, 

OO OO 

©m,, = sup Y 0i,2q,j,k < C p 2 2q Y^ + 1)5* < C p? 2q (l - 5)~ 2 (m + l)5 m = o(m ~ a ) 

j,k<p . 

J ’ — L i=m %—m 

for any a > 1. Applying Lemma 1 we prove part (i). 
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Let us turn to bound sup J<p || 2<? - Let A[ be ( ai,jk)j,k< P with l > 1. Since A[ is a banded 
matrix with the bandwidth min(2 lk 0 + 1 ,p), we can bound H^Hoq by 

p 

Pilloo = maxy, \a l jk \ < {min(2 lk 0 + l,p)} 1/2 ||A z 1 || 2 < C(2lk 0 + 1 )S l ,l > 1, (A.10) 

' " r. 

which implies that H^Hoo < C(2k 0 + 1)(/ + 1 j S l .1 > 0. Using the innovation representation 
}’t = SLo A[e t -i, we get 

OO p CO p 

hi ,i || 2q *£n£ ^l,jk^k,i— 1 1| 2q ^ ££ l a Zjfc| i 11 2<3t- 

Z=0 k =1 1=0 k=l 

As a result, sup^ hiAUg < C(2k 0 + = c ( 2k o + 1)(1 - 3)~ 2 p 2q < oo. 

Similarly, we can bound sup J<p 0 h2q , 3 above by C(i + 1)5* with some positive constant C since we 
have a nice inequality 

p 

hj,i — 112g — II ^ ^ a i,jk(^k, 0 — £ fc,o) Ihqr £ 2/X2g 11 11 oo - 

k =1 

The proof is complete. □ 

Lemma 6. Suppose that Conditions (l)-(3) and 4(ii) in Section 3.1 of the original article hold. Then 
we have 


(i) sup \Tjjk — Tjjk 

l<7,/c<p 



](H) sup | eJj)X(k)\=0 P 

1 <j,k<p 


[n lo 


Proof. Here we prove part (i) only. The proof of part (ii) can be derived similarly. 

Note that y t = + e t -i and ||Ai|| 2 < 6 < 1. Let A\ be ( aijk)j,k< P ■ For each j, y ht = 

Ya=o EL 1 converges almost surely. Write p jA = EL=i a i,jm£m,t-i for l > 0. We 

divide y ht into two terms y :l i = E/Lo hjp + ELjv„+i Vj,it- Here we choose N n to be N$ log(n) with 
Ng > (1 + a )a -1 (— log5) _1 . Hence, rTC ]k can be expressed as 


” \ ~ f " \ 

n ^ik = 2_, l 2^ 4j,itVk,i't I + 2^ LL Vj,uVk,i't 

1,1'= 0 \t= 1 J l,l'=N n +l \t= 1 / 

Nn oo / n \ oo JV n / n \ 

+ £ £ £ VjpVkpt I + £ £ £ Vj,itVk,i't j 

Z=0 Z'=Af n +l \t=l / Z=7V n +l Z'=o \t=l / 

— + <Sjfc,2 + Sjk, 3 + <Sjfe,4, 

and n= El=i { S jk,m~ E(S jkjm )}. Let us handle the first term S jkt i - E(S jkjl ). Note 
that if sup m ,£ , {exp(|f 0 £m,z| 2a )} < oo, 

Ce = sup E{ exp (toLm,z£m',z'| Q )} < oo. 

171,1,771',l' 
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By Lemma 4, we obtain the following equality, 


n 

f |^| ^ ^ E{s m g—Z^mgt—Z') } 3(/ n 4" l)x 


< (l n + 1 )Ci exp 


+ C\{l n + 1 ) exp ( — —— - 5 - 


V C 2 n + Csn 2 ~ a xJ V C 2 n 2 -“ + ) 

+Ci(l n + l)nexp (—C 4 a: a ) 

for some positive constants Cj(j = 1,... ,4), where l n = |/ — /'|. Taking a; = C(n log/)) 1 / 2 for 
some large constant C > 0, we derive the following term 


^ In SUp (/ T 1) (l T 1) y ^ E{£ m i— Z') } 

m,m r <p,l,V <N n ^ 

with the convergence rate Op{(n log//) 1 / 2 }. Observe that 

71 

| ^ ^ 5/ C(2ko + 1) 2 (/ + 1) 3 (^ + l) 3 5 i+i ' q n , (A.11) 

t =i 

and E/= 0 (Z + 1) 3 / / < oo. Therefore, 

sup S jktl -~E(S jktl ) <C(2k 0 + l) 2 rj n = O P {(n\ogp) 1/2 }. 

j,k<p 


Consider the second term. Since sup m ^{ exp (|f 0 e m /| 2 “)} < oo, ( q ,e = sup m ,i,m',i' £m,i£m',i 
Cq 1 /“ for any q > 2. Now we bound S ]k2 — ESj k _ 2 . To be specific, 


v < 


n p 

Y, {Vj,ltVk,l't ^ E(Vj,ltVk,l't )) < U yy K,jm|| UZ,fcm/ 1 Slip £m,Z£m,Z' 

. . <7 . . m,m',l,l' 


m,m '=1 


< n(2/c 0 + !)"(/ + 1)(/ / + Cg,£- 


(A. 12) 


Hence, 


S jkt2 ~ E(S jkt2 ) < Cnq l/a Y (l + l)(l'+ l)8 l+l ’ <C-nNy^q 1 ^ 

a zJ 


l,l'=N n +l 


Write rj n 2 = ( nN^8 2Nn ) ' jS'j/.p — To (.Sjp^) } - It follows from Lemma 2 that there exists a constant 
A > 0 such that E[ exp(A| z/ n2 | a )} < oo. Consequently, for a large constant C > 0 , we have that 


prj sup S jk ,2 - E{S jki2 ) >C(logn) : 

^ j)k<p 


< 0{l)p 2 exp { - tC a • n(log n)~ 2a (log n) 2a j ->■ 0, 
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as n —> oo, which implies that sup.,- k<p \s jK2 - E(S jkt2 )\ = Op | (logn) 2 1 = o P { (n logp) 1/2 }. 


Similarly, we can prove that sup ^- k<p S jkjn — E{Sp, m ) 
Finally putting together the convergence rate results 


= op{ (n log p) 1 / 2 },?^ = 3,4. 

or the four terms we conclude 


sup 

j,k<p 


S jk — s 


jk 


= Op{(n Mogp) 172 }. 


□ 


The proof is complete. 

Lemma 7. Suppose that Conditions (l)-(3) and 4(i) or 4(ii) in Section 3.1 of the original article 
hold. Then, for each finite k with k > k {] , 

RSS i(k) 


sup 

l<i<p 


no 2 


= 0 P \ (n Mogp) 1/2 


as n —> oo, where RSS,; (A) is defined in (2.6) and of is the (i, i)-th element o/S £ . 

Proof. For k > k 0 , the term RSS fik) can be decomposed as 

RSS i(k) = efe, - ejX^XfX^Xfei = R a - R i2 , 

where e ^ = (e ii2 ,... , £i, n ) T , and X t is a (n — 1) x rfk) matrix with x lA+] as its j-th row. We will 
show below that, under Assumptions (l)-(3) and 4(i) or 4(ii), 


(a) sup 

i<p 


| R, i — no 

With results in (a) and (b), it follows that 


= 0 P { (n log p) 1/2 }; (b) sup | R i2 \ = 0 P (log p). 

i<p 


sup 

RSS fk) i 

2 ^ 

< sup 

Ril 

2 ^ 

+ sup 

R'l'2 

2 

i<p 

nof 

i<p 

nof 

i<p 

nof 


= 0 P {(n Mogp) 172 }. 

Suppose first that Condition 4(i) holds. Consider the term Rp — nof. Lemma 1 shows that 

= Op{(nlogp) 1/2 }. 


sup 

i<p 


T 2 

e, e* - no. 


Let us handle the term 


sup i< p \n i2 


Rif. Define 


An = \ inf A min (n 1 XjX l ) > nfil ~ 

t i<p 


with 0 < r] < 1. It follows from Lemma 5(i) and Condition 3 that P(A n ) —> 1 as n —> oo. On the 
event A n , the term sup i<p \R i2 \ can be bounded above by [nfil — r])) 1 k 0 sup j fc<p 1 1e(Cx( fc )| 2 . By 
Lemma 5 (ii), we obtain that, sup^ k<p \ eJ j} X( k -j\ = Op{(nlogp) 1 ' /2 j, which implies that (b) holds. 

Suppose that Condition 4(ii) holds. Consider the term R, i — nof. By Lemma 3 and taking 
x = C(n logp) 1 / 2 with large constant C > 0, we have that 


sup 

i<p 


T 

a 


UGa 


Op{(n logp) 1/2 }. 


Similarly, we can establish (b) from Lemma 6. The proof is complete. 


□ 
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A.8 An additional Simulation Study 

We conduct an additional Monte Carlo experiment to examine the proposed methodology. We con¬ 
sider a banded vector autoregressive model with the sample size n + 2, where the last two observa¬ 
tions are used to calculate the one-step and two-step ahead post-sample prediction errors. 

The data were generated from a vector autoregressive model with d — 1 and the banded coef¬ 
ficient matrix A specified in scenario (1) in the paper. We set n = 200, p = 100,200, k 0 = 2, 
and each setting was repeated 100 times. To mimic the real world with the true ordering unknown, 
we considered three other orderings through random permutation. The first ordering was generated 
through local permutation, where we partitioned the components of y t into [p/5] groups with each 
group containing 5 components. We then performed a random permutation within each group. The 
other two orderings were generated through permutating the whole components of y t together. Also 
included in the comparison is the sparse autoregressive model determined by lasso. Table [7] below 
reports simulation results of Bayesian information criterion scores and prediction errors. It indicates 
that the model with the true ordering offers the best post-sample prediction, followed by the model 
with the local permutation only, and then the lasso-based model, while the two models with arbitrary 
permutations perform the worst. 
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Table 7: Average Bayeysian information criterion values, estimated bandwidth parameter and one- 
step-ahead and two-step-ahead post-sample predictive errors over 100 replications, with their corre¬ 
sponding standard errors in parentheses. 


Ordering 

BIC 

Bandwidth 

One-step ahead 

Two-step ahead 

Case 1: n — 200, p = 100 

True ordering 

546(3.5) 

1.78(0.52) 

0.787(0.06) 

0.837(0.07) 

Local permutation 

549(5.6) 

2.14(0.83) 

0.788(0.06) 

0.837(0.07) 

Random permutation 

546(11) 

0.71(0.90) 

0.828(0.07) 

0.848(0.08) 

Random permutation 

547(13) 

0.70(1.06) 

0.827(0.06) 

0.848(0.08) 

Lasso 

- 

- 

0.823(0.06) 

0.846(0.08) 


Case 2: n = 200, p = 200 


True ordering 

Local permutation 
Random permutation 
Random permutation 

1093(4.8) 

1102(10) 

1098(22) 

1096(20) 

1.87(0.36) 

2.39(0.75) 

0.89(0.80) 

0.78(0.70) 

0.786(0.04) 

0.787(0.04) 

0.829(0.05) 

0.831(0.05) 

0.830(0.05) 

0.829(0.05) 

0.839(0.05) 

0.838(0.05) 

Lasso 

- 

- 

0.827(0.05) 

0.838(0.05) 
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